Matthew Naeher

Introduction

In this report, we will be visualizing baby name data from the US over the past century. The data for this project has been provided by the United States Social Security Administration. With this data, Pandas dataframes will be used to organize name counts by state, year, and sex. Using the dataframe, the data will then be visualized using Plotly Express. The two types of plots used are line graphs and choropleth maps. The line graphs will illustrate a change in name frequency or diversity over time, while the choropleth maps will allow this data to be visualized on a state-by-state basis.

Using this dataset, we will explore three topics:

  1. Popularity of the name "Matthew" over time
  2. Name diversity over time
  3. The top name in each state per year

Preliminaries

The code below imports necessary packages (e.g., Plotly Express, Pandas, and glob) so the dataset can be downloaded and unzipped.

In [1]:
import plotly.express as px
In [2]:
from glob import glob

from zipfile import ZipFile
import requests
import pandas as pd

from glob import glob


#When I tried unzipping namesbystate.zip the way that was provided on the course website, it created a gpgz file which I could not open. 
#To work around this, I unzipped this by passing the unzip command to the shell and loading all of the individual text files into the working directory.
#I'm sorry for this inconvinience, but this was the only way I could figure out how to get the code working.
!unzip namesbystate.zip 
Archive:  namesbystate.zip
  inflating: AK.TXT                  
  inflating: AL.TXT                  
  inflating: AR.TXT                  
  inflating: AZ.TXT                  
  inflating: CA.TXT                  
  inflating: CO.TXT                  
  inflating: CT.TXT                  
  inflating: DC.TXT                  
  inflating: DE.TXT                  
  inflating: FL.TXT                  
  inflating: GA.TXT                  
  inflating: HI.TXT                  
  inflating: IA.TXT                  
  inflating: ID.TXT                  
  inflating: IL.TXT                  
  inflating: IN.TXT                  
  inflating: KS.TXT                  
  inflating: KY.TXT                  
  inflating: LA.TXT                  
  inflating: MA.TXT                  
  inflating: MD.TXT                  
  inflating: ME.TXT                  
  inflating: MI.TXT                  
  inflating: MN.TXT                  
  inflating: MO.TXT                  
  inflating: MS.TXT                  
  inflating: MT.TXT                  
  inflating: NC.TXT                  
  inflating: ND.TXT                  
  inflating: NE.TXT                  
  inflating: NH.TXT                  
  inflating: NJ.TXT                  
  inflating: NM.TXT                  
  inflating: NV.TXT                  
  inflating: NY.TXT                  
  inflating: OH.TXT                  
  inflating: OK.TXT                  
  inflating: OR.TXT                  
  inflating: PA.TXT                  
  inflating: RI.TXT                  
  inflating: SC.TXT                  
  inflating: SD.TXT                  
  inflating: StateReadMe.pdf         
  inflating: TN.TXT                  
  inflating: TX.TXT                  
  inflating: UT.TXT                  
  inflating: VA.TXT                  
  inflating: VT.TXT                  
  inflating: WA.TXT                  
  inflating: WI.TXT                  
  inflating: WV.TXT                  
  inflating: WY.TXT                  
In [3]:
#Find all files of type txt
glob('*.TXT')
Out[3]:
['IN.TXT',
 'IL.TXT',
 'KS.TXT',
 'SC.TXT',
 'HI.TXT',
 'GA.TXT',
 'SD.TXT',
 'CO.TXT',
 'NH.TXT',
 'MS.TXT',
 'MD.TXT',
 'UT.TXT',
 'LA.TXT',
 'ME.TXT',
 'WI.TXT',
 'NJ.TXT',
 'AR.TXT',
 'NY.TXT',
 'MT.TXT',
 'OK.TXT',
 'MA.TXT',
 'NM.TXT',
 'WY.TXT',
 'OH.TXT',
 'OR.TXT',
 'NV.TXT',
 'TX.TXT',
 'TN.TXT',
 'AZ.TXT',
 'MN.TXT',
 'WA.TXT',
 'WV.TXT',
 'NC.TXT',
 'MO.TXT',
 'AL.TXT',
 'VA.TXT',
 'CA.TXT',
 'CT.TXT',
 'AK.TXT',
 'ND.TXT',
 'VT.TXT',
 'MI.TXT',
 'NE.TXT',
 'KY.TXT',
 'ID.TXT',
 'DC.TXT',
 'IA.TXT',
 'FL.TXT',
 'PA.TXT',
 'RI.TXT',
 'DE.TXT']

Below, the data from each individual txt file is concatenated into a single pandas dataframe.

In [4]:
file_names = glob('*.TXT')

df = pd.concat(
    (pd.read_csv(f, names=['state', 'sex', 'year', 'name', 'count']) for f in file_names)
).reset_index(drop=True)

df.head()
Out[4]:
state sex year name count
0 IN F 1910 Mary 619
1 IN F 1910 Helen 324
2 IN F 1910 Ruth 238
3 IN F 1910 Dorothy 215
4 IN F 1910 Mildred 200

Part 1: Popularity of the name "Matthew"

To start, we will look at the popularity of the common name "Matthew." Before we can visualize the data, we need to preprocess. To do so, a new dataframe is made to hold the count of the name Matthew for each year and state. Then, the number of Matthews will be divided by the total number of babies to determine the percentage of babies named Matthew.

In [5]:
# Create a dataframe of just Matthews
matthew = df[df['name']=="Matthew"]
#Create a dataframe holding the total number of babies for each state and year
n_babies = df.groupby(by=["year","state"])["count"].sum()
n_babies_year = df.groupby(by=["year"])["count"].sum()
n_babies
Out[5]:
year  state
1910  AK         115
      AL       19694
      AR       11041
      AZ        1074
      CA        9163
               ...  
2019  VT        2481
      WA       63447
      WI       46472
      WV       12767
      WY        2362
Name: count, Length: 5610, dtype: int64
In [6]:
#Total the Matthews for the entire country for each year and find the percentage
matthew_total = matthew.groupby(by=["year"])['count'].sum()
matthew_ratio = (matthew_total/n_babies_year)*100
matthew_ratio.reset_index()
Out[6]:
year count
0 1910 0.032538
1 1911 0.031105
2 1912 0.040203
3 1913 0.044430
4 1914 0.044536
... ... ...
105 2015 0.407146
106 2016 0.407774
107 2017 0.388453
108 2018 0.338176
109 2019 0.318109

110 rows × 2 columns

In [7]:
px.line(matthew_ratio, x = matthew_ratio.index, y="count", title="Percentage of babies named 'Matthew' by year", labels={"count": "Percentage named Matthew"})

In the plot above, we see that the name Matthew saw a rise in popularity after 1950 and peaked in 1983 at 1.6%. It has been on a steady decline ever since.

Now, we will evaluate the popularity of the name on a state-by-state basis. To do so, we will now group the Matthew dataframe by year and state and then find the frequency of the name by state.

In [8]:
# Seperate again by state.
matthew_total_state = matthew.groupby(by=["year","state"])['count'].sum()
matthew_ratio_state = (matthew_total_state/n_babies)*100
#Find the percentage of Matthews for each state
matthew_ratio_state
Out[8]:
year  state
1910  AK            NaN
      AL            NaN
      AR            NaN
      AZ            NaN
      CA       0.065481
                 ...   
2019  VT       0.443370
      WA       0.250603
      WI       0.187210
      WV       0.250646
      WY       0.465707
Name: count, Length: 5610, dtype: float64
In [9]:
matthew_ratio_state = matthew_ratio_state.reset_index()
In [10]:
matthew_ratio_state
Out[10]:
year state count
0 1910 AK NaN
1 1910 AL NaN
2 1910 AR NaN
3 1910 AZ NaN
4 1910 CA 0.065481
... ... ... ...
5605 2019 VT 0.443370
5606 2019 WA 0.250603
5607 2019 WI 0.187210
5608 2019 WV 0.250646
5609 2019 WY 0.465707

5610 rows × 3 columns

To illustrate the popularity of the name Matthew on a state-by-state basis, a choropleth map will be used. This map can be animated such that the data for each year between 1910 and 2019 can be visualized. The darker the green color, the more popular the name "Matthew" is in that state.

In [11]:
fig = px.choropleth(matthew_ratio_state, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="count",
                    title="Percentage of babies named Matthew",
                    color_continuous_scale = "greens",
                    range_color=(0, 2),
                    animation_frame="year",
                    hover_name="count",
                    labels={"count": "Percentage of babies named Matthew"},
                    hover_data = {"state":True}
                   )



fig.update_traces(marker_line_color="white")
fig.show()

Looking at the change over time in the map, it is clear that when the name was on the rise after 1950 that it was first popular in the north before the south. When the name's popularity started to decrease post 1983, there was no clear regional pattern.

Part 2: Analyzing name diversity

Similar to the analysis of the name Matthew, we will now analyze how the number of different names has changed over time. First, we will see how many different names have been used across the whole country by year. Then, we will use a cholopleth map to see which state used the highest number of different names per year.

In [12]:
#Disregard sex when getting the number of different names
df.drop("sex", axis=1)
Out[12]:
state year name count
0 IN 1910 Mary 619
1 IN 1910 Helen 324
2 IN 1910 Ruth 238
3 IN 1910 Dorothy 215
4 IN 1910 Mildred 200
... ... ... ... ...
6122885 DE 2019 River 5
6122886 DE 2019 Rocco 5
6122887 DE 2019 Shane 5
6122888 DE 2019 Syncere 5
6122889 DE 2019 Yasir 5

6122890 rows × 4 columns

In [13]:
unique_total = df.groupby(by=["year"])['name'].nunique()
unique_total.reset_index()
Out[13]:
year name
0 1910 1693
1 1911 1740
2 1912 2261
3 1913 2476
4 1914 2863
... ... ...
105 2015 9577
106 2016 9327
107 2017 9171
108 2018 9124
109 2019 8957

110 rows × 2 columns

In [14]:
#Plpt unique names
px.line(unique_total, x = unique_total.index, y="name", labels={"name":"Number of different names used"},title="The number of different names given to babies in the US (1910-2019)")

In the plot above, the general trend shows that name diversity in the US has increased over time although we are now in a period of decline since the peak of 10,023 in 2007. It should also be noted that the actual number of different names is higher than reflected in this graph (and the subsequent plots in part 2 for name diversity) because data was only recorded if more than 5 babies were given the name.

Next, we will analyze the breakdown by state.

In [15]:
# Group data by year and state
unique_total_state = df.groupby(by=["year","state"])['name'].nunique()
unique_total_state = unique_total_state.reset_index()
unique_total_state
Out[15]:
year state name
0 1910 AK 16
1 1910 AL 596
2 1910 AR 466
3 1910 AZ 103
4 1910 CA 360
... ... ... ...
5605 2019 VT 241
5606 2019 WA 2181
5607 2019 WI 1796
5608 2019 WV 749
5609 2019 WY 252

5610 rows × 3 columns

Here, a choropleth map will be used to indicate the number of different names used in each state for each year. The darker the purple color, the higher the number of different names in a given state.

In [16]:
fig = px.choropleth(unique_total_state, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="name",
                    title="Number of different names used by state",
                    color_continuous_scale = "purples",
                    range_color=(0, 7000),
                    animation_frame="year",
                    hover_name="state",
                    labels={"name": "Different baby names"},
                    hover_data = {"state":True}
                   )

 

fig.update_traces(marker_line_color="white")
fig.show()

Using the choropleth, we see the same trend of increased diversity over time. It is also apparent that the states New York, California, Texas, and Florida have the most name diversity in the past few decades. This can likely be attributed to the fact that these are the highest states in terms of population and that these states are among the most ethnically diverse.

Next, let's see if there's a significant difference in the number of different boy names and girl names.

In [17]:
#Group data by year and sex
unique_by_sex = df.groupby(by=["year","sex"])['name'].nunique()
In [18]:
unique_by_sex = unique_by_sex.reset_index()
unique_by_sex
Out[18]:
year sex name
0 1910 F 1083
1 1910 M 692
2 1911 F 1066
3 1911 M 754
4 1912 F 1261
... ... ... ...
215 2017 M 4310
216 2018 F 5377
217 2018 M 4297
218 2019 F 5280
219 2019 M 4252

220 rows × 3 columns

In [19]:
#Plot male and female data seperately
px.line(unique_by_sex, x="year", y="name", color='sex', title="Male vs Female name diversity in US (1910-2019)", labels={"name":"Number of different names"})

The plot above illustrates that there is a similar trend in the name diversity of males and females, while females have consistently had a larger number of different names.

To conclude this analysis of name diversity, we will evaluate the breakdown by state and sex.

In [20]:
# Also group by sex now.
unique_by_sex_state = df.groupby(by=["year","state","sex"])['name'].nunique()
unique_by_sex_state.reset_index()
Out[20]:
year state sex name
0 1910 AK F 8
1 1910 AK M 8
2 1910 AL F 375
3 1910 AL M 240
4 1910 AR F 300
... ... ... ... ...
11215 2019 WI M 882
11216 2019 WV F 373
11217 2019 WV M 395
11218 2019 WY F 115
11219 2019 WY M 139

11220 rows × 4 columns

In [21]:
#Allow data to be grouped by sex
unique_by_state_grouped = unique_by_sex_state.groupby(by="sex")
In [22]:
#Make dataframe for girls
girls = unique_by_state_grouped.get_group('F')
girls = girls.reset_index()

The choropleth below shows the number of different female names per state for each year. The darker the pink color, the more different names that have been assigned.

In [23]:
#Create choropleth for female data
fig = px.choropleth(girls, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="name",
                    title="Number of different female names used by state",
                    color_continuous_scale = "magenta",
                    range_color=(0, 4000),
                    animation_frame="year",
                    hover_name="state",
                    labels={"name": "Different baby female names"},
                    hover_data = {"state":True}
                   )


fig.update_traces(marker_line_color="white")
fig.show()
In [24]:
#Create dataframe for boys
boys = unique_by_state_grouped.get_group('M')
boys = boys.reset_index()
boys
Out[24]:
year state sex name
0 1910 AK M 8
1 1910 AL M 240
2 1910 AR M 183
3 1910 AZ M 45
4 1910 CA M 130
... ... ... ... ...
5605 2019 VT M 139
5606 2019 WA M 1054
5607 2019 WI M 882
5608 2019 WV M 395
5609 2019 WY M 139

5610 rows × 4 columns

The choropleth below shows the number of different male names per state for each year. The darker the blue color, the more different names that have been assigned.

In [25]:
#Create choropleth for boys 
fig = px.choropleth(boys, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="name",
                    title="Number of different male names used by state",
                    color_continuous_scale = "blues",
                    range_color=(0, 4000),
                    animation_frame="year",
                    hover_name="state",
                    labels={"name": "Different baby male names"},
                    hover_data = {"state":True}
                   )



fig.update_traces(marker_line_color="white")
fig.show()

The choropleths, now divided by sex, mostly mirror the trends from the combined choropleth as the more populous states tend to have more names.

In this final section, choropleth maps will be used to illustrate the most popular name for each year in every state. Data will be divided between males and females such that the most popular name from each sex can be seen.

In [26]:
n_babies = df.groupby(by=["state","year","sex"])["count"].sum()
In [27]:
# Function that gets the most popular name
def top_name(grp):
    return grp.sort_values(by="count", ascending=False).head(1)
In [28]:
# Create datatframe for most popular male names
top_state_name = df.groupby(by="sex")

top_state_boys = top_state_name.get_group('M')
top_state_boys.reset_index()
top_state_boys

most_popular_boys = top_state_boys.groupby(by=["state","year", "sex"]).apply(top_name)
In [29]:
#Make choropleth of most popular male names
fig = px.choropleth(most_popular_boys, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="name",
                    title="Frequency of state's most popular name",
                    color_continuous_scale = "blues",
                    #range_color=(0, 0.1),
                    animation_frame="year",
                    hover_name="state",
                    labels={"name": "Top boy's name"},
                    hover_data = {"state":False, "year":False}
                   )



fig.update_traces(marker_line_color="white")
fig.show()

In the choropleth above, the five most popular names for each year are listed in the legend. If a state's most popular name is not amongst the nation's top five, the state will be colored gray. In the early 20th century, the names "John" and "Robert" were popular in most states. Toward the end of the century, names such as "Michael" or "Jacob" reached similar heights, but for shorter periods of time.

In [30]:
top_state_girls = top_state_name.get_group('F')
top_state_girls.reset_index()
top_state_girls

most_popular_girls = top_state_girls.groupby(by=["state","year", "sex"]).apply(top_name)
In [31]:
fig = px.choropleth(most_popular_girls, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="name",
                    title="Most popular girl's name by state",
                    color_continuous_scale = "blues",
                    range_color=(0, 0.1),
                    animation_frame="year",
                    hover_name="state",
                    labels={"name": "Top girl's name"},
                    hover_data = {"state":False}
                   )



fig.update_traces(marker_line_color="white")
fig.show()

In the choropleth above, the three most popular female names for each year are listed in the legend. If a state's most popular name is not amongst the nation's top three, the state will be colored gray. We can ssee that the nanme "Mary" was extremely popular across the country for much of the early 20th century. Later in the century, the names "Lisa" and "Jennifer" were the most popular in almost every state, albeit for a shorter period of time.

Conclusion

Upon analyzing the data provided by the Social Security Administration, it is clear that the popularity of certain names has changed over time. Americans have also become more creative with their naming, as the number of different baby names has steadily increased over time. Thanks to the tools provided by pandas and Plotly Express, this data could be easily processed and presented in a way that is easily digestible without having to read through thousands of rows of data. Plots such as choropleth maps allow us to take visualization further by breaking down data geographically by state.

In [ ]: